The HVT package is a collection of R functions to facilitate building topology preserving maps for rich multivariate data. Tending towards a big data preponderance, a large number of rows. A collection of R functions for this typical workflow is organized below :
This procedure includes installing all the necessary packages to smoothly execute this notebook without encountering any issues.
packages <- c("dplyr","magrittr", "ggplot2", "plyr", "patchwork", "plotly", "gganimate", "magick", "kableExtra", "DT", "htmltools")
pkgInstall <- function(x){
if(x %in% rownames(installed.packages()) == FALSE){
install.packages(x, dependencies = TRUE)
if(!require(x,character.only = TRUE)) stop(paste(x, "package not found"))
}
do.call("require", list(x))
}
packageStatus <- lapply(packages, FUN=pkgInstall)
# sourcing the updated R scripts
script_dir <- "./R"
r_files <- list.files(script_dir, pattern = "\\.R$", full.names = TRUE)
for (file in r_files) {
source(file)
}
The Lorenz system serves as a notable example of a deterministic chaotic system. This means despite the system is governed by a set of equations and rules, even a slight alteration in the initial conditions can lead to a substantial divergence in its outcomes. The Lorenz Attractor which is the dataset used in this notebook represents the solution or graphical representation of the Lorenz system. A 3D simulation of this system can be observed here. It takes the form of an overlapping set of spirals resembling a butterfly.
The Dataset has 200,000 rows and 5 columns which are explained below:
The dataset is loaded from a CSV file, and for later function usage, only the first 15,000 rows are isolated and utilized. This segmentation is essential because executing the trainHVT() function with the parameters hvt_validation and diagnose set to TRUE results in excessive memory usage, and to avoid this issue, the dataset is segmented and the columns selected are X,Y & Z.
data_raw <- read.csv("./lorenze_attractor.csv")
data <- data_raw[1:15000,-c(1)]
data_hvt <- data_raw[1:15000,-c(1,5,6)]
######################PREVIEW###############
data_preview <- data_raw[1:10,-1]
datatable(data_preview,
options = list(pageLength = 10,
dom = 'tp',
scrollX = TRUE,
scrollY = '400px'
)
)
The dataset is split into train and test with 60% and 40% respectively.
##the train and test dataset is randomly shuffled split to examine the versatility.
set.seed(123)
##the split percentage considered here is 60-40.
split_ratio <- 0.6
##replace = FALSE to avoid duplicates
#indices <- sample(1:nrow(data), size = nrow(data), replace = FALSE)
split_point <- round(split_ratio * nrow(data))
# Create train and test sets
train_data <- data[(1:split_point), ]
train_data <- train_data[,-c(4,5)]
test_data_fm <- data[((split_point + 1):nrow(data)), ]
test_data <- test_data_fm[,-c(4,5)]
This step compresses the rows (long data frame) using a compression objective which is Vector quantization (VQ), HVQ (hierarchical vector quantization) using means or medians.
The output of this function is a nested list objects.
hvt_mapA <- trainHVT(train_data,n_cells = 100, depth = 1,quant.err = 0.2, distance_metric = "L1_Norm", error_metric = "mean",projection.scale = 10,normalize = TRUE, seed = 123, quant_method="kmeans",hvt_validation = TRUE, diagnose = TRUE, train_validation_split_ratio=0.8)
NOTE: The output will be stored in the R environment.
This function is used to construct hierarchical voronoi tessellations in 1D,2D and Interactive surface plot
plotHVT( heatmap = '1D')
plotHVT(hvt_mapA, line.width = c(1.2), color.vec = c('#000000'), pch = 21, centroid.size = 1,
maxDepth = 1,heatmap = '2Dhvt')
2D Heatmap with the column ‘n’ which is
no. of datapoints inside each cell from the trainHVT
results.
plotHVT(hvt_mapA, data, centroid.size = 1,
child.level = 1, hmap.cols = "n",
line.width = c(0.6), color.vec = ('#000000') ,
pch1 = 21, heatmap = '2Dheatmap')
Interactive surface plot with the column ‘n’ which is
no. of datapoints inside each cell from the trainHVT
results on the z-axis.
plotHVT( hvt_mapA, child.level = 1, hmap.cols = "n", n_cells.hmap = NULL,
layer_opacity = c(0.7,0.8), dim_size = 1000, heatmap = 'surface_plot' )
Scoring new data sets and recording their assignment using the map objects from the above steps, in a sequence of maps if required.
score_var <- scoreHVT(test_data, hvt_mapA, mad.threshold = 1.2, distance_metric = "L1_Norm", error_metric = "mean",normalize = TRUE , seed = 249, child.level = 1)
Let’s see which cell and level each point belongs to, for each of the 6000 records.
###########table display################
data_scored <- score_var$actual_predictedTable
rownames(data_scored) <- NULL
#kable(data_predictions, align = "lllllllllllll") %>% kable_styling(bootstrap_options = c("bordered"), full_width = F) #%>% scroll_box(width = "800px")
datatable(data_scored,
options = list(pageLength = 10,
dom = 'tp',
scrollX = TRUE,
scrollY = '400px'
)
)
This function is used to plot the manually identified outlier cells
in the 2D-hvt Plot. (Here the four cells whose
segment.child values 9,35,58,95 are selected for the sake
of function explaination)
identified_Novelty_cells <<- c(9,35,58,95)
plotNovelCells(identified_Novelty_cells,hvt_mapA,line.width = c(0.6),color.vec = c("#000000"),pch1 = 21,centroid.size = 0.5,title = "Identified Novelty Cells in 2D HVT plot", maxDepth = 1)
This function is used to remove the manually identified outlier cells from the parent dataset. The output list of this function will have:
output_list <- removeNovelty(identified_Novelty_cells,hvt_mapA)
The scoreLayeredHVT function is used to score the test dataset using the predictive set of maps. This function takes an input - a test dataset and a set of maps (map A, map B, map C).
############DATA WITH NOVELTY######################
data_with_novelty <- output_list[[1]] %>% dplyr::select(!c("Cell.ID", "Cell.Number"))
############DATA WITHOUT NOVELTY######################
data_without_novelty <- output_list[[2]]
####################HVT-MAP-B################################
hvt_mapB <- trainHVT(data_with_novelty,n_cells = 50, depth = 1,quant.err = 0.2, distance_metric = "L1_Norm", error_metric = "mean",normalize = TRUE, quant_method="kmeans",diagnose = FALSE)
##########################HVT-MAP-C######################
mapA_scale_summary <- hvt_mapA[[3]]$scale_summary
hvt_mapC <- trainHVT(data_without_novelty,n_cells = 70, depth = 1,quant.err = 0.2, distance_metric = "L1_Norm", error_metric = "mean",normalize = TRUE, quant_method="kmeans", scale_summary = mapA_scale_summary)
##output
data_predictions_scored <- list()
data_predictions_scored <- scoreLayeredHVT(test_data, hvt_mapA, hvt_mapB, hvt_mapC,normalize = TRUE)
Let’s see which cell and layer each point belongs to for the 6000 records.
The output of scoreLayeredHVT will contain two columns Layer1.Cell.ID and Layer2.Cell.ID. Layer1.Cell.ID contains cell ids from map A in the form A1,A2,A3…. and Layer2.Cell.ID contains cell ids from map B as B1,B2… depending on the identified novelties and map C as C1,C2,C3…..
##########################TABLE DISPLAY########################
data_scored_layer <- data_predictions_scored$actual_predictedTable
rownames(data_scored_layer) <- NULL
#kable(data_predictions, align = "lllllllllllll") %>% kable_styling(bootstrap_options = c("bordered"), full_width = F) #%>% scroll_box(width = "800px")
datatable(data_scored_layer,
options = list(pageLength = 10,
dom = 'tp',
scrollX = TRUE,
scrollY = '400px'
)
)
HVT model diagnostics are used to evaluate the model fit and investigate the proximity between centroids. Model validation is used to measure the fit/quality of the model. Measuring model fit is the key to iteratively improving the models.
The plotModelDiagnostics() function for HVT Model provides 5 diagnostic plots when diagnose = TRUE which are as follows:
plot_diag[[1]]
plot_diag[[2]][[1]]
plot_diag[[2]][[2]]
plot_diag[[3]][[1]]
plot_diag[[3]][[2]]
Setting the hvt.validation parameter to TRUE will yield the Mean Absolute Deviation Plot for Validation (test) data.
plot_diag[[4]]
This is the function that produces histograms displaying the distribution of Quantized Error (QE) values for both training and test datasets, highlighting mean values with dashed lines for quick evaluation.
plotQuantErrorHistogram(hvt_mapA, score_var)